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Abstract. Tractable generalizations of the Gaussian distribution play an important role for 
the analysis of high-dimensional data. One very general super-class of Normal distributions is 
the class of ^-spherical distributions whose random variables can be represented as the product 
x = r ■ u of a uniformly distribution random variable u on the 1-level set of a positively 
homogeneous function v and arbitrary positive radial random variable r. Prominent subclasses 
of j/-sphcrical distributions are spherically symmetric distributions (y(x) = \\x\\2) which have 
been further generalized to the class of L p -spherically symmetric distributions {v(x) = ||a:||p). 
Both of these classes contain the Gaussian as a special case. In general, however, ^-spherical 
distributions are computationally intractable since, for instance, the normalization constant or 
fast sampling algorithms are unknown for an arbitrary v. In this paper we introduce a new 
subclass of z/-spherical distributions by choosing v to be a nested cascade of Lp-norms. This 
class, which we consequently call Lp-nested symmetric distributions is still computationally 
tractable, but includes all the aforementioned subclasses as a special case. We derive a general 
expression for Lp-nested symmetric distributions as well as the uniform distribution on the 
Lp-nested unit sphere, including an explicit expression for the normalization constant. We 
state several general properties of Lp-nested symmetric distributions, investigate its marginals, 
maximum likelihood fitting and discuss its tight links to well known machine learning methods 
such as Independent Component Analysis (ICA), Independent Subspace Analysis (ISA) and 
mixed norm regularizers. Finally, we derive a fast and exact sampling algorithm for arbitrary 
Lp-nested symmetric distributions, and introduce the Nested Radial Factorization algorithm 
(NRF), which is a form of non-linear ICA that transforms any linearly mixed, non-factorial 
Lp-nested source into statistically independent signals. 

parametric density model, symmetric distribution, ^-spherically symmetric distributions, non- 
linear independent component analysis, independent subspace analysis, robust Bayesian inference, 
mixed norm density model, uniform distributions on mixed norm spheres, nested radial factoriza- 
tion 

1. Introduction 

High-dimensional data analysis virtually always starts with the measurement of first and second- 
order moments that are sufficient to fit a multivariate Gaussian distribution, the maximum entropy 
distribution under these constraints. Natural data, however, often exhibit significant deviations 
from a Gaussian distribution. In order to model these higher-order correlations, it is necessary 
to have more flexible distributions available. Therefore, it is an important challenge to find 
generalizations of the Gaussian distribution which are, on the one hand, more flexible but, on the 
other hand, still exhibit enough structure to be computationally and analytically tractable. In 
particular, probability models with an explicit normalization constant are desirable because they 
make direct model comparison possible by comparing the likelihood of held out test samples for 
different models. Additionally, such models often allow for a direct optimization of the likelihood. 

One way of imposing structure on probability distributions is to fix the general form of the 
iso-density contour lines. This approach was taken by Fernandez et al. [1995]. They modeled the 
contour lines by the level sets of a positively homogeneous function of degree one, i.e. functions v 
that fulfill v(a-x) = a-v(x) for x e IR" and a E \Rq . The resulting class of z^-spherical distributions 
have the general form p(x) = p{v{x)) for an appropriate p which causes p(x) to integrate to one. 
Since the only access of p to x is via v one can show that, for a fixed v, those distributions are 
generated by a univariate radial distribution. In other words, ^-spherically distributed random 
variables can be represented as a product of two independent random variables: one positive radial 
variable and another variable which is uniform on the 1-level set of v. This property makes this 
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class of distributions easy to fit to data since the maximum likelihood procedure can be carried 
out on the univariate radial distribution instead of the joint density. Unfortunately deriving 
the normalization constant for the joint distribution in the general case is intractable because it 
depends on the surface area of those level sets which can usually not be computed analytically. 

Known tractable subclasses of i^-spherical distributions are the Gaussian, elliptically contoured, 
and Lp-spherical distributions. The Gaussian is a special case of elliptically contoured distribu- 
tions. After centering and whitening x := C~ 1,/2 (s — E[s]) a Gaussian distribution is spherically 
symmetric and the squared L2-norm ||x||2 = x\ + • • • + x 2 n of the samples follow a ^-distribution 
(i.e. the radial distribution is a ^-distribution). Elliptically contoured distributions other than 
the Gaussian are obtained by using a radial distribution different from the x-distribution [Fang 
ct al., 1990; Kelker, 1970]. 

The extension from L 2 - to L p -spherically symmetric distributions is based on replacing the 
L2-norm by the L p -norm 



in the definition of the density. That is, the density of L p -spherical distributions can always 
be written in the form p(x) = p{\\x\\ p ). Those distributions have been studied by Osiewalski 
and Steel [1993] and Gupta and Song [1997]. We will adopt the naming convention of Gupta 
and Song [1997] and call ||x|| p an L p -norm even though the triangle inequality only holds for 
p > 1. Lp-sphcrically symmetric distribution with p ^ 2 are no longer invariant with respect 
to rotations (transformations from SO(n)). Instead, they are only invariant under permutations 
of the coordinate axes. In some cases, it may not be too restrictive to assume permutation or 
even rotational symmetry for the data. In other cases, such symmetry assumptions might not be 
justified and let the model miss important regularities. 

Here, we present a generalization of the class of L p -spherically symmetric distribution within 
the class of ^-spherical distributions that makes weaker assumptions about the symmetries in the 
data but still is analytically tractable. Instead of using a single L p -norm to define the contour of 
the density, we use a nested cascade of L p -norms where an L p -norm is computed over groups of 
L p -norms over groups of L p -norms each of which having a possibly different p. Due to this 
nested structure we call this new class of distributions L p -nested symmetric distributions. The 
nested combination of L p -norms preserves positive homogeneity but does not require permutation 
invariance anymore. While L p -nested distributions are still invariant under reflections of the 
coordinate axes, permutation symmetry only holds within the subspaces of the L p -norms at the 
bottom of the cascade. As demonstrated in Sinz et al. [2009b], one possible application domain of 
L p -nested symmetric distributions are patches of natural images. In the current paper, we would 
like to present a formal treatment of this class of distributions. We ask readers interested in the 
application of this distributions to natural images to refer to Sinz et al. [2009b] . 



We demonstrate below that the construction of the nested L p -norm cascade still bears enough 
structure to compute the Jacobian of polar-like coordinates similar to those of Song and Gupta 
[1997] and Gupta and Song [1997]. With this Jacobian at hand it is possible to compute the 
univariate radial distribution for an arbitrary i p -nested density and to define the uniform distri- 
bution on the L p -nested unit sphere IL^ = {x e IR™|i/(a;) = 1}. Furthermore, we compute the 
surface area of the I/ p -nested unit sphere and, therefore, the general normalization constant for 
L p -nested distributions. By deriving these general relations for the class of L p -nested distributions 
we have determined a new class of tractable ^-spherical distributions which is so far the only one 
containing the Gaussian, elliptically contoured, and L p -spherical distributions as special cases. 

-L p -sphcrically symmetric distributions have been used in various contexts in statistics and 
machine learning. Many results carry over to L p -nested symmetric distributions which allow a 
wider application range. Osiewalski and Steel [1993] showed that the posterior on the location 
of a i p -sphcrically symmetric distributions together with an improper Jeffrey's prior on the scale 
does not depend on the particular type of L p -spherically symmetric distribution used. Below, 
we show that this results carries over to L p -nested symmetric distributions. This means that 
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we can robustly determine the location parameter by Bayesian inference for a very large class of 
distributions. 

A large class of machine learning algorithms can be written as an optimization problem on 
the sum of a regularizer and a loss functions. For certain regularizers and loss functions, like 
the sparse L\ regularizer and the mean squared loss, the optimization problem can be seen as 
the Maximum A Posteriori (MAP) estimate of a stochastic model in which the prior and the 
likelihood arc the negative exponentiated regularizer and loss terms. Since p{x) cx exp(— is 
an Lp-spherically symmetric model, regularizers which can be written in terms of a norm have a 
tight link to Lp-spherically symmetric distributions. In an analogous way, L p -nested distributions 
exhibit a tight link to mixed-norm regularizers which have recently gained increasing interest in 
the machine learning community [see e.g. Kowalski et al., 2008; Yuan and Lin, 2006; Zhao et al., 
2008]. Lp-nested symmetric distributions can be used for a Bayesian treatment of mixed- norm 
regularized algorithms. Furthermore, they can be used to understand the prior assumptions made 
by such regularizers. Below we discuss an implicit dependence assumptions between the regularized 
variables that follows from the theory of L p -nested symmetric distributions. 

Finally, the only factorial Lp-spherically symmetric distribution [Sinz et al., 2009a], the p- 
gencralized Normal distribution, has been used as an ICA model in which the marginals follow 
an exponential power distribution. This class of ICA is particularly suited for natural signals 
like images and sounds [Lee and Lewicki, 2000; Lewicki, 2002; Zhang et al., 2004]. Interestingly, 
Lp-spherically symmetric distributions other than the p-generalized Normal give rise to a non- 
linear ICA algorithm called Radial Gaussianization for p = 2 [Lyu and Simoncelli, 2009] or Radial 
Factorization for arbitrary p [Sinz and Bethge, 2009] . As discussed below, L p -nested distributions 
are a natural extension of the linear Lp-spherically symmetric ICA algorithm to ISA, and give rise 
to a more general non-linear ICA algorithm in the spirit of Radial Factorization. 

The remaining part of the paper is structured as follows: in Section 2 we define polar-like 
coordinates for Lp-nested symmetrically distributed random variables and present an analytical 
expression for the determinant of the Jacobian for this coordinate transformation. Using this 
expression, we define the uniform distribution on the Lp-nested unit sphere and the class of Lp- 
nested symmetric distributions for an arbitrary L p -nested function in Section 3. In Section 4 we 
derive an analytical form of Lp-nested symmetric distributions when marginalizing out lower levels 
of the Lp-nested cascade and demonstrate that marginals of L p -nested symmetric distributions 
are not necessarily Lp-nested. Additionally, we demonstrate that the only factorial L p -nested 
symmetric distribution is necessarily Lp-spherical and discuss the implications of this result for 
mixed norm regularizers. In Section 5.1 we propose an algorithm for fitting arbitrary Lp-nested 
models and derive a sampling scheme for arbitrary Lp-nested symmetric distributions. In Section 
6 we generalize a result by Osiewalski and Steel [1993] on robust Bayesian inference on the location 
parameter to L p -nested symmetric distribution. In Section 7 we discuss the relationship of Lp- 
nested symmetric distributions to ICA, ISA and their possible role as prior on hidden variable 
in over-complete linear models. Finally, we derive a non-linear ICA algorithm for linearly mixed 
non- factorial L p -nested sources in Section 8 which we call Nested Radial Factorization (NRF). 

2. Lp-NESTED FUNCTIONS, COORDINATE TRANSFORMATION AND JACOBIAN 

Consider the function 



with p$,p\ G IR + . This function is obviously a cascade of two L p -norms and is thus positively 
homogeneous of degree one. Figure 1(a) shows this function visualized as a tree. Naturally, any 
tree like the ones in Figure 1 corresponds to a function of the kind of equation (1). In general, 
the n leaves of the tree correspond to the n coefficients of the vector x 6 IR™ and each inner node 
computes the L p -norm of its children using its specific p. We call the class of functions which 
is generated in that way L p -nested and the corresponding distributions, that are symmetric or 
invariant with respect to it, L p -nested symmetric distributions. 



(1) 
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(a) Equation (1) as tree. (b) Equation (1) as tree in multi-index notation. 

Figure 1. Equation (1) visualized as a tree with two different naming conven- 
tions. Figure (a) shows the tree where the nodes are labeled with the coefficients 
of x £ IR™. Figure (b) shows the same tree in multi-index notation where the 
multi-index of a node describes the path from the root node to that node in the 
tree. The leaves Ui,i>2,i and V2.2 still correspond to x±,X2 and X3, respectively, 
but have been renamed to the multi-index notation used in this article. 



/(•) =/«(•) 


Lp-nested function 




Multi-index denoting a node in the tree. The single indices describe 
the path from the root node to the respective node /. 


XI 


All entries in x that correspond to the leaves in the subtree under 
the node I. 


x T 


All entries in x that are not leaves in the subtree under 
the node /. 


//(•) 


Lp-nested function corresponding to the subtree under the node /. 


V(tl 


Function value at the root node 


VI 


Function value at an arbitrary node with multi-index /. 


ll 


The number of direct children of a node /. 


71/ 


The number of leaves in the subtree under the node i\ 


V I,l:£l 


Vector with the function values at the direct children of a node /. 



Table 1 . Summary of the notation used for L p -ncstcd functions in this article. 



Lp-nested functions are much more flexible in creating different shapes of level sets than single 
Lp-norms. Those level sets become the iso-density contours in the family of Lp-nested symmetric 
distributions. Figure 2 shows a variety of contours generated by the simplest non-trivial L p -nested 
function shown in equation (f). The shapes show the unit spheres for all possible combinations of 
PiDiPi € {0.5, 1,2, 10}. On the diagonal, and p\ are equal and therefore constitute Lp-norms. 
The corresponding distributions are members of the Lp-spherically symmetric class. 

In order to make general statements about general Lp-nested functions, we introduce a notation 
that is suitable for the tree structure of L p -nested functions. As we will heavily use that notation in 
the remainder of the paper, we would like to emphasize the importance of the following paragraphs. 
We will illustrate the notation with an example below. Additionally, Figure 1 and Table 1 can be 
used for reference. 

We use multi-indices to denote the different nodes of the tree corresponding to an Lp-nested 
function /. The function / = f$ itself computes the value v% at the root node (see Figure 1). 
Those values are denoted by variables v. The functions corresponding to its children are denoted 
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pi =0.5 pi = 1 pi = 2 Pi = 10 




FIGURE 2. Variety of contours created by the L p -nested function of equation (1) 
for all combinations of pq^pi £ {0.5, 1, 2, 10}. 

by A, ...,fe 9 , i-e. /(•) = /fl(-) = ||(/i(-), -, A (-))IU- We always use the letter l T indexed by the 
node's multi-index to denote the total number of direct children of that node. The functions of 
the children of the i th child of the root node are denoted by /^i, /^^ and so on. In this manner, 
an index is added for denoting the children of a particular node in the tree and each multi-index 
denotes the path to the respective node in the tree. For the sake of compact notation, we use upper 
case letters to denote a single multi-index / = i\, ...,ie- The range of the single indices and the 
length of the multi-index should be clear from the context. A concatenation /, k of a multi-index 
I with a single index k corresponds to adding k to the index tuple, i.e. I,k = i%, i m , k. We use 
the convention that 7, = /. Those coefficients of the vector x that correspond to leaves of the 
subtree under a node with the index I are denoted by X[. The complement of those coefficients, 
i.e. the ones that are not in the subtree under the node I, are denoted by xj. The number of 
leaves in a subtree under a node I is denoted by nj. If J denotes a leaf then tij = 1. 
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The Lp-nested function associated with the subtree under a node / is denoted by 

fiixi) = \\(fi,i(xi tl ), fi t t [ (xi t e I )) T \\ Pl . 

Just like for the root node, we use the variable vi to denote the function value vi — of a 

subtree /. A vector with the function values of the children of / is denoted with bold font vi 1 .^ I 
where the colon indicates that we mean the vector of the function values of the lj children of node 
/: 

fi(xi) = \\(fi,i(xi } i), ...J Lll {x Lll )) T \\ pi 

= \\(vi,l,-,V I4l ) T \\ pi = ||t>j,l:^|| pi . 

Note that we can assign an arbitrary p to leaf nodes since p for single variables always cancel. 
For that reason we can choose an arbitrary p for convenience and fix its value to p — 1. Figure 
1(b) shows the multi-index notation for our example of equation (1). 

To illustrate the notation: Let / — ...,i<2 be the multi-index of a node in the tree. i±, ■■■,id 
describes the path to that node, i.e. the respective node is the i*] 1 child of the i t J l _ 1 child of the 
*d-2 child of the ... of the i[ h child of the root node. Assume that the leaves in the subtree below 
the node / cover the vector entries x 2 , x w . Then Xj = (x 2 , Xiq), xj = Xn, £12, •■■), 
and nj = 9. Assume that node / has l\ = 2 children. Those would be denoted by /, 1 and /, 2. 
The function realized by node / would be denoted by // and only acts on xj. The value of the 
function would be fi(xi) — vi and the vector containing the values of the children of / would be 

VJ ,1:2 = (w/,l,W/, 2 ) T = (.fls{xis),fl,2{xi, 2 )) T - 

We now introduce a coordinate representation that is especially tailored to L p -nested symmetri- 
cally distributed variables: One of the most important consequence of the positive homogeneity of 
/ is that it can be used to "normalize" vectors and, by that property, create a polar like coordinate 
representation of a vector x. Such polar-like coordinates generalize the coordinate representation 
for Lp-norms by Gupta and Song [1997]. 

Definition 1 (Polar-like Coordinates) . We define the following polar-like coordinates for a vector 
x e IR": 

u i = ~F7~\ for * = 1) ••! n _ 1 
f(x) 

r = f(x). 

The inverse coordinate transformation is given by 

Xi — rui for i = 1 , . . ., n — 1 

where A„ = sgnx„ and u n = 

Note that u n is not part of the coordinate representation since normalization with 1/ f(x) 
decreases the degrees of freedom u by one, i.e. u n can always be computed from all other by 
solving /(u) = f(x/f(x)) = 1 for u n . We only use the term u n for notational simplicity. With 
a slight abuse of notation, we will use u to denote the normalized vector x/ f(x) or only its first 
n — 1 components. The exact meaning should always be clear from the context. 

The definition of the coordinates is exactly the same as the one by Gupta and Song [1997] 
with the only difference that the L p -norm is replaced by an L p -nested function. Just as in the 
case of Lp-spherical coordinates, it will turn out that the determinant of the Jacobian of the 
coordinate transformation does not depend on the value of A„ and can be computed analytically. 
The determinant is essential for deriving the uniform distribution on the unit L p -nested sphere 
IL/, i.e. the 1-level set of /. Apart from that, it can be used to compute the radial distribution for 
a given L p -nested distribution. We start by stating the general form of the determinant in terms 
of the partial derivatives §^7, Uk and r. Afterwards we demonstrate that those partial derivatives 
have a special form and that most of them cancel in Laplace's expansion of the determinant. 
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Lemma 1 (Determinant of the Jacobian). Let r and u be defined as in Definition 1. The general 
form of the determinant of the Jacobian J = {^jf^j of the inverse coordinate transformation for 
Hi = r and y.i = for i = 2, n, is given by 



(2) \detj\=r^ UV-^. Ufc 




Proof. The proof can be found in the Appendix A. □ 

The problematic part in equation (2) are the terms g^S which obviously involve extensive usage 
of the chain rule. Fortunately, most of them cancel when inserting them back into equation (2), 
leaving a comparably simple formula. The remaining part of this section is devoted to computing 
those terms and demonstrating how they vanish in the formula for the determinant. Before we state 
the general case we would like to demonstrate the basic mechanism through a simple example. We 
urge the reader to follow the next example as it illustrates all important ideas about the coordinate 
transformation and its Jacobian. 

Example 1 . Consider an L p -nested function very similar to our introductory example of equation 
(1): 

/(») = (dur + \ X2 n% + \ X3 \p^ . 

Setting u = and solving for 113 yields 

(3) /(u) = l^ u 3 = (i-(M Pl + M P1 )-)^ 

We would like to emphasize again, that u 3 is actually not part of the coordinate representation and 
only used for notational simplicity. By construction, U3 is always positive. This is no restriction 
since Lemma 2 shows that the determinant of the Jacobian does not depend on its sign. However, 
when computing the volume and the surface area of the L p -nested unit sphere it will become 
important since it introduces a factor of 2 to account for the fact that 1*3 (or u n in general) can 
in principle also attain negative values. 



Now, consider 

G 2 (u 2 ) = ^(us) 1 ^ = (l-(Kr- 
F^U!) = Mu^-pi = (\ Ul r + \u 2 



P0 

\u 2 



IP1 )P1 



!-p0 
P0 



P0-P1 



where the subindices of u, /, <?, G and F have to be read as multi-indices. The function gi 
computes the value of the node / from all other leaves that are not part of the subtree under / by 
fixing the value of the root node to one. 

6*2(1*2) and -Fi(ui) are terms that arise from applying the chain rule when computing the 
partial derivatives g^ 4 . Taking those partial derivatives can be thought of as pealing off layer by 
layer of Equation (3) via the chain rule. By doing so, we "move" on a path between ti 3 and U}.. 
Each application of the chain rule corresponds to one step up or down in the tree. First, we move 
upwards in the tree, starting from 113. This produces the G-terms. In this example, there is only 
one step upwards, but in general, there can be several, depending on the depth of u n in the tree. 
Each step up will produce one G-term. At some point, we will move downwards in the tree to 
reach u^. This will produce the i^-terms. While there are as many G-terms as upward steps, there 
is one term less when moving downwards. Therefore, in this example, there is one term 62(^2) 
which originates from using the chain rule upwards in the tree and one term ^(ui) from using it 
downwards. The indices correspond to the multi-indices of the respective nodes. 

Computing the derivative yields 

^ = -G 2 («2)^i(ui)A fe Kr- 1 . 



8 



FABIAN SINZ AND MATTHIAS BETHGE 



By inserting the results in equation (2) we obtain 
ij| = X)G 2 (u 2 )F 1 (u 1 )|u fc r+' 



U 3 

k=l 



= G 2 (u 5 ) F 1 (u 1 ) KP + 1 - F 1 (u 1 )F 1 (u 1 )- 1 (|mr + KP 



'•Hi 



k=l 
1 



G 2 (« 5 ) I F x ( Ul ) £ KP + 1 - *i(ui) £ KP 
G 2 (m 2 ). 



fc=i fe=i 



The example suggests that the terms from using the chain rule downwards in the tree cancel 
while the terms from using the chain rule upwards remain. The following proposition states that 
this is true in general. 

Proposition 1 (Determinant of the Jacobian). Let C be the set of multi-indices of the path from 
the leaf u n to the root node (excluding the root node) and let the terms G/^ (iij-jr ) recursively be 
defined as 



(4) GuAuftf = gxtMfx,?*' 1 '- 1 ' 1 = [a^tY 1 -pJiA^r 

where each of the functions gi.ij computes the value of the I th child of a node I as a function of 
its neighbors {1, 1), (I,£i — 1) and its parent I while fixing the value of the root node to one. 
This is equivalent to computing the value of the node I from all coefficients that are not leaves 
in the subtree under I. Then, the determinant of the Jacobian for an L p -nested function is given 
by 

det\J\=r n ' 1 l[G L (u z ). 

Proof. The proof can be found in the Appendix A. □ 

Let us illustrate the determinant with two examples: 
Example 2. Consider a normal L p -norm 

/(*) = (en p J 

which is obviously also an L p -nested function. Resolving the equation for the last coordinate of 

i 

the normalized vector u yields g n (^n) = u n = ( 1 — Y^7=i \ u i\ P ) P ■ Thus, the term G n (v,n) term is 



given by ^1 — \ u i\ p ^j " which yields a determinant of | det J\ = r n 1 ^1 — 

This is exactly the one derived by Gupta and Song [1997]. 

Example 3. Consider the introductory example 

/(*) = (n* + (\x 2 r+\x 3 n^)^ ■ 

Normalizing and resolving for the last coordinate yields 

u 3 = ((i-KD^ -Kp)" 
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and the terms 6*2(1x5) and G^^i^^pi) of the determinant | detj'l = r 2 G2{u^)G2 l 2{ u 22) ar e given 
by 

G 2 ( M5 ) = (i-Kr)^ 

Note the difference to Example f where X3 was at depth one in the tree while X3 is at depth two 
in the current case. For that reason, the determinant of the Jacobian in Example f only involved 
one G-term while it has two G-terms here. 



3. Lp-NESTED Symmetric and L p -Nested Uniform Distribution 



In this section, we define the L p -nested symmetric and the L p -nested uniform distribution and 
derive their partition functions. In particular, we derive the surface area of an arbitrary L p -nested 
unit sphere \if = {x € IR ra | f{x) = 1} corresponding to an L p -nested function /. By equation (5) 
of Fernandez et al. [1995] every i^-spherically symmetric and hence any L p -nested density has the 
form 



(5) 



p(x) 



where Sf is the surface area of \lf and g is a density on IR + . Thus, we need to compute the surface 
area of an arbitrary L p -nested unit sphere to obtain the partition function of equation (5). 

Proposition 2 (Volume and Surface of the L p -nested Sphere). Let f be an L p -nested function 
and let T be the set of all multi-indices denoting the inner nodes of the tree structure associated 
with f. The volume Vf(R) and the surface St(R) of the L p -nested sphere with radius R are given 
by 



(6) 
(7) 
(8) 
(9) 



V f (R) 



R n T 



1,-1 

lex Pi k=\ 



Ek 
i=i n i,k ni,k+i 



Pi 



Pi 



R n 2 n 



n 





ni.k 1 
Pi \ 







R n - i * n n 4=i n b 



S S {R) 



R n-l 2 n Yl 



ieiPi 

ni J =1 r 



it=i 

ni.k 



J2i=l n I,k Tll,k+1 



Pi 



Pi 



-i-1t 

lex Pf T 



where B[a,b] — ^[l^b] denotes the (3 -function. 
Proof. The proof can be found in the Appendix B. 



□ 



Inserting the surface area in equation 5, we obtain the general form of an Lp-nested symmetric 
distribution for any given radial density g. 

Corollary 1 (L p -nested Symmetric Distribution). Let f be an L p -nested function and g a density 
on \R + . The corresponding L p -nested symmetric distribution is given by 



p(x) 



(10) 



f(x)"-iS f (l) 

<?(/(*))_ 

-1 

lex 



2 n f(x) n - 



k=l 



J2i=i n i,k ni,k+i 



Pi 



Pi 
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The results of Fernandez et al. [1995] imply that for any z^-spherically symmetric distribution, 
the radial part is independent of the directional part, i.e. r is independent of u. The distribution 
of u is entirely determined by the choice of v, or by the L p -nested function / in our case. The 
distribution of r is determined by the radial density g. Together, an L p -nested symmetric distri- 
bution is determined by both, the L p -nested function / and the choice of g. From equation (10), 
we can see that its density function must be the inverse of the surface area of IL f times the radial 
density when transforming (5) into the coordinates of Definition 1 and separating r and u (the 
factor f(x) n ~ 1 — r cancels due to the determinant of the Jacobian). For that reason we call the 
distribution of u uniform on the L p -sphere ILj in analogy to Song and Gupta [1997]. Next, we 
state its form in terms of the coordinates u. 

Proposition 3 (L p -nested Uniform Distribution). Let f be an L p -nested function. Let C be set 
set of multi-indices on the path from the root node to the leaf corresponding to x n . The uniform 
distribution on the L p -nested unit sphere, i.e. the set IL y — {x £ IR"|/(a;) = 1} is given by the 
following density over U\, 
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Proof. Since the L p -nested sphere is a measurable and compact set, the density of the uniform 
distribution is simply one over the surface area of the unit L p -nested sphere. The surface 5/(1) 
is given by Proposition 2. Transforming g^x) m *° ^ ne coordinates of Definition 1 introduces 
the determinant of the Jacobian from Proposition 1 and an additional factor of 2 since the 
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-i) G IR™ have to account for both half-shells of the -Lp-nested unit sphere, i.e. to 



account for the fact that u n could have been be positive or negative. This yields the expression 
above. □ 

Example 4. Let us again demonstrate the proposition at the special case where / is an L p -norm 
f(x) = \\x\\ p = (J2i=i \ x i\ p ) p ■ Using Proposition 2, the surface area is given by 
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The factor G n {Un) is given by ^1 — J2*i=i P ( see the L p -norm example before), which, 

after including the factor 2, yields the uniform distribution on the L p -sphere as defined in Song 
and Gupta [1997] 

p n ~ 1 T 5 / n-1 \ ^ 



p(u) 



Example 5. As a second illustrative example, we consider the uniform density on the L p -nested 
unit ball, i.e. the set {x G IR™| f(x) < 1}, and derive its radial distribution g. The density 
of the uniform distribution on the unit L p -nested ball does not depend on x and is given by 
p{x) = 1/V/(1). Transforming the density into the polar-like coordinates with the determinant 
from Proposition 1 yields 
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After separating out the uniform distribution on the L p -nested unit sphere, we obtain the radial 
distribution 

g(r) = nr"^ 1 for < r < 1 
which is a /3-distribution with parameters n and 1. 
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The radial distribution from the preceding example is of great importance for our sampling 
scheme derived in Section 5.2. The idea behind it is the following: First, a sample from an "simple" 
Lp-nested distribution is drawn. Since the radial and the uniform component on the L p -nested 
unit sphere are statistically independent, we can get a sample from the uniform distribution on the 
Lp-nested unit sphere by simply normalizing the sample from the simple distribution. Afterwards 
we can multiply it with a radius drawn from the radial distribution of the Lp-nested distribution 
that we actually want to sample from. The role of the simple distribution will be played by the 
uniform distribution within the L p -nested unit ball. Sampling from it is basically done by applying 
the steps in proof of Proposition 2 backwards. We lay out the sampling scheme in more detail in 
Section 5.2. 

4. Marginals 

In this section we discuss two types of marginals: First, we demonstrate that, in contrast to 
Lp-spherically symmetric distributions, marginals of Lp-nested distributions are not necessarily 
Lp-nested again. The second type of marginals we discuss are obtained by collapsing all leaves of a 
subtree into the value of the subtree's root node. For that case we derive an analytical expression 
and show that the values of the root node's children follow a special kind of Dirichlct distribution. 

Gupta and Song [1997] show that marginals of L p -spherically symmetric distributions are again 
Lp-spherically symmetric. This does not hold, however, for L p -nested symmetric distributions. 
This can be shown by a simple counterexample. Consider the L p -nested function 
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The uniform distribution inside the L 
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p-nested ball corresponding to / is given by 



p(x) = 

The marginal p(xi,X3) is given by 
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This marginal is Lp-spherically symmetric. Since any L p -nested distribution in two dimensions 
must be L p -spherically symmetric it cannot be Lp-nested symmetric as well. Figure 3 shows a 
scatter plot of the marginal distribution. Besides the fact that the marginals are not contained 
in the family of L p -nested distributions, it is also hard to derive a general form for them. This is 
not surprising given that the general form of marginals for L p -spherically symmetric distributions 
involves an integral that cannot be solved analytically in general and is therefore not very useful 
in practice [Gupta and Song, 1997]. For that reason we cannot expect marginals of L p -nested 
symmetric distributions to have a simple form. 

In contrast to single marginals, it is possible to specify the joint distribution of leaves and 
inner nodes of an Lp-nested tree if all descendants of their inner nodes in question have been 
integrated out. For the simple function above (the same that has been used in Example 1), the 
joint distribution of x 3 and v\ = || (xi, £2) ||pi would be an example of such a marginal. Since 
marginalization affects the L p -nested tree vertically, we call this type of marginals layer marginals. 
In the following, we present their general form. 

From the form of a general Lp-nested function and the corresponding symmetric distribution, 
one might think that the layer marginals are L p -nested again. However, this is not the case since 
the distribution over the Lp-nested unit sphere would deviate from the uniform distribution in 
most cases if the distribution of its children was Lp-spherically symmetric. 

Proposition 4. Let f be an L p -nested function. Suppose we integrate out complete subtrees from 
the tree associated with f , that is we transform subtrees into radial times uniform variables and 
integrate out the latter. Let J be the set of multi-indices of those nodes that have become new 
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Figure 3. Marginals of L p -nested symmetric distributions are not necessarily 
Lp-nested symmetric: Figure (a) shows a scatter plot of the (x±, X2)-marginal of 
the counterexample in the text with p$ = 2 and pi = |. Figure (d) displays 
the corresponding L p -nested sphere, (b-c) show the univariate marginals for 
the scatter plot. Since any two-dimensional i p -nested distribution must be L p - 
spherical, the marginals should be identical. This is clearly not the case. Thus, 
(a) is not L p -nested symmetric. 



leaves, i.e. whose subtrees have been removed, and let nj be the number of leaves (in the original 
tree) in the subtree under the node J. Let € IR m denote those coefficients of x that are still 
part of that smaller tree and let v j denote the vector of inner nodes that became new leaves. The 
joint distribution of and Vj is given by 

(11) p(Xj,Vj)- 



S f (f(xj,vj)) 



IT 



Proof. The proof can be found in the Appendix C. 



□ 
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Equation (11) has an interesting special case when considering the joint distribution of the root 
node's children. 

Corollary 2. The children of the root node Vi-k — (v%, W£ e ) T follow the distribution 
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where m < l§ is the number of leaves directly attached to the root node. In particular, V\-t t can 
be written as the product RU , where R is the L p -nested radius and the single \Ui\ Pt! are Dirichlet 

distributed, i.e. (|£/i| PB , |J7a,| P0 ) ~ Dir 
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Proof. The joint distribution is simply the application of Proposition (4). Note that f{v\, Vu) 
Wvi-.iinWpt,- Applying the pointwise transformation s$ = |ui| Pe yields 
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The Corollary shows that the values fi(xj) at inner nodes /, in particular the ones directly 
below the root node, deviate considerably from L p -spherical symmetry. If they were L p -spherically 
symmetric, the \Ui\ p should follow a Dirichlet distribution with parameters cti = ^ as has been 
already shown by Song and Gupta [1997]. The Corollary is a generalization of their result. 

We can use the Corollary to prove an interesting fact about L p -nested symmetric distributions: 
The only factorial L p -nested symmetric distribution must be L p -spherically symmetric. 



Proposition 5 

L. 



Let x be L p -nested symmetric distributed with independent marginals. Then x is 



p-spherically symmetric distributed. In particular, x follows a p-generalized Normal distribution. 



Proof. The proof can be found in the Appendix D. 
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One immediate implication of Proposition 5 is that there is no factorial probability model corre 
sponding to mixed norm regularizers which have of the form ^2^_ 
form a partition of the dimensions 1, n [see e.g. Kowalski et al. 
et al., 2008]. Many machine learning algorithms are equivalent to minimizing the sum of a regu- 
larizer R(w) and a loss function L(w, x\, x m ) over the coefficient vector w. If the exp (— R(w)) 
and exp (— L(w, x±, ...,x m )) correspond to normalizeable density models, the minimizing solution 
of the objective function can be seen as the Maximum A Posteriori (MAP) estimate of the pos- 
terior p (w\xi, ...,x m ) oc p(w) ■ p(xi, x m \w) — exp (— R(w)) ■ exp (— L(w, x x , x m )). In that 
sense, the regularizer naturally corresponds to the prior and the loss function corresponds to the 
likelihood. Very often, regularizers are specified as a norm over the coefficient vector w which 
in turn correspond to certain priors. For example, in Ridge regression [Hoerl, 1962] the coeffi- 
cients are regularized via ||u>||| which corresponds to a factorial zero mean Gaussian prior on w. 
The Li-norm ||iu||i in the LASSO estimator [Tibshirani, 1996], again, is equivalent to a factorial 
Laplacian prior on w. Like in these two examples, regularizers often correspond to a factorial 
prior. 

Mixed norm regularizers naturally correspond to L p -nested distributions. Proposition 5 shows 
that there is no factorial prior that corresponds to such a regularizer. In particular, it implies 
that the prior cannot be factorial between groups and coefficients at the same time. This means 
that those regularizers implicitly assume statistical dependencies between the coefficient variables. 
Interestingly, for q = 1 and p = 2 the intuition behind these regularizers is exactly that whole 
groups Ik get switched on at once, but the groups are sparse. The Proposition shows that this 
might not only be due to sparseness but also due to statistical dependencies between the coefficients 
within one group. The L p -nested distribution which implements independence between groups will 
be further discussed below as a generalization of the p-generalized Normal (see Section 7) . Note 
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that the marginals can be independent if the regularizer is of the form Ei=i H^/tHp- However, in 
this case p = q and the L p -nested function collapses to a simple L p -norm which means that the 
regularizer is not mixed norm. 



5. Estimation of and Sampling from L p -Nested Symmetric Distributions 

5.1. Maximum Likelihood Estimation. In this section, we describe procedures for maximum 
likelihood fitting of L p -nested symmetric distributions on data. We provide a toolbox online for 
fitting Lp-spherically symmetric and L p -nested symmetric distributions to data. The toolbox can 
be downloaded at http://www.kyb.tuebingen.mpg.de/bethge/code/. 

Depending on which parameters are to be estimated the complexity of fitting an L p -nested 
symmetric distribution varies. We start with the simplest case and later continue with more 
complex ones. Throughout this subsection, we assume that the model has the form p(x) = 
p(Wx) ■ | det W\ = f{w ^Zfl j{1)) ■ | det W\ where W € IR" xn is a complete whitening matrix. This 
means that given any whitening matrix Wo, the freedom in fitting W is to estimate an orthonormal 
matrix Q £ SO(n) such that W — QWq. This is analogous to the case of elliptically contoured 
distributions where the distributions can be endowed with 2nd-order correlations via W. In the 
following, we ignore the determinant of W since that data points can always be rescaled such that 
detW= 1. 

The simplest case is to fit the parameters of the radial distribution when the tree structure, the 
values of the pi and W are fixed. Due to the special form of L p -nested symmetric distributions (5) 
it then suffices to carry out maximum likelihood estimation on the radial component only, which 
renders maximum likelihood estimation efficient and robust. This is because the only remaining 
parameters are the parameters "d of the radial distribution and, therefore, 

argmax 1? logp(Wa;|i?) = argmax^ (— log Sf(f(Wx)) + log g(f(Wx)\ , &)) 

— argmaxfl log g(f(Wx)\&). 

In a slightly more complex case, when only the tree structure and W are fixed, the values of the 
pi, I £ 1 and # can be jointly estimated via gradient ascent on the log-likelihood. The gradient 
for a single data point x with respect to the vector p that holds all pi for all I £ I is given by 

d (n — 1) 

V p log p(Wx) = - log g(f(Wx)) ■ V p f(Wx) ~ ^_iv p /(W») - V p log 5/(1). 

For i.i.d. data points Xi the joint gradient is given by the sum over the gradients for the single 
data points. Each of them involves the gradient of / as well as the gradient of the log-surface area 
of IL f with respect to p, which can be computed via the recursive equations 
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where \&[t] = 4? logT[t] denotes the digamma function. When performing the gradient ascent one 
needs to set as a lower bound for p. Note that, in general, this optimization might be a highly 
non-convex problem. 

On the next level of complexity, only the tree structure is fixed and W can be estimated along 
with the other parameters by joint optimization of the log-likclihood with respect to p, i? and W. 
Certainly, this optimization problem is also not convex in general. Usually, it is numerically more 
robust to whiten the data first with some whitening matrix Wq and perform a gradient ascent 
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on the special orthogonal group SO{n) with respect to Q for optimizing W — QW - Given the 
gradient log p(Wx) of the log-likelihood the optimization can be carried out by performing 
line searches along geodesies as proposed by Edelman et al. [1999] (see also Absil et al. [2007]) or 
by projecting Vw log p(Wx) on the tangent space TwSO{n)) and performing a line search along 
SO(n) in that direction as proposed by Manton [2002]. 

The general form of the gradient to be used in such an optimization scheme can be defined as 

Vw log p(Wx) 
=V W (-(n - 1) • log f{Wx) + log g(f(Wx))) 

= ii^y ' v - f {Wx) ■ xT + dXo ^ 1 {I{Wx)) ■ Vyf {Wx) ■ xT - 

where the derivatives of / with respect to y are defined by recursive equations 

if i I 

sgny, lfvi ik = \yi\ 

«}~ PJ - "ft -1 forie/,fc. 

Note, that / might not be diffcrcntiable at y — 0. However, we can always define a sub-derivative 
at zero, which is zero for pi ^ 1 and [—1,1] for pj = 1. Again, the gradient for i.i.d. data points 
Xi is given by the sum over the single gradients. 

Finally, the question arises whether it is possible to estimate the tree structure from data as 
well. So far, we were not able to come up with an efficient algorithm to do so. A simple heuristic 
would be to start with a very large tree, e.g. a full binary tree, and to prune out inner nodes 
for which the parents and the children have sufficiently similar values for their pi. The intuition 
behind this is that if they were exactly equal, they would cancel in the L p -nested function. This 
heuristic is certainly sub-optimal. Firstly, the optimization will be time consuming since there can 
be about as many pj as there are leaves in the L p -nested tree (a full binary tree on n dimensions 
will have n — 1 inner nodes) and the repeated optimization after the pruning steps. Secondly, the 
heuristic does not cover all possible trees on n leaves. For example, if two leaves are separated by 
the root node in the original full binary tree, there is no way to prune out inner nodes such that 
the path between those two nodes will not contain the root node anymore. 

The computational complexity for the estimation of all other parameters despite the tree struc- 
ture is difficult to assess in general because they depend, for example, on the particular radial 
distribution used. While the maximum likelihood estimation of a simple log-Normal distribution 
only involves the computation of a mean and a variance which are in 0(m) for m data points, a 
mixture of log-Normal distributions already requires an EM algorithm which is computationally 
more expensive. Additionally, the time it takes to optimize the likelihood depends on the starting 
point as well as the convergence rate and we neither have results about the convergence rate nor 
is it possible to make problem independent statements about a good initialization of the param- 
eters. For this reason we only state the computational complexity of single steps involved in the 
optimization. 

Computation of the gradient V p log p(Wx) involves the derivative of the radial distribution, 
the computation of the gradients V p f(Wx) and V p 5/(1). Assuming that the derivative of the 
radial distribution can be computed in 0(1) for each single data point, the costly steps are the 
other two gradients. Computing V p f(Wx) basically involves visiting each node of the tree once 
and performing a constant number of operations for the local derivatives. Since every inner node 
in an L p -nested tree must have at least two children, the worst case would be a full binary tree 
which has 2n — 1 nodes and leaves. Therefore, the gradient can be computed in 0(nm) for m 
data points. For similar reasons, f(Wx), V p log 5/(1) and the evaluation of the likelihood can 
also be computed in 0(nm). This means that each step in the optimization of p can be done 
0(nm) plus the computational costs for the line search in the gradient ascent. When optimizing 
for W = QWo as well, the computational costs per step increase to 0(n 3 + n 2 m) since m data 
points have to be multiplied with W at each iteration (requiring 0(n 2 m) steps) and the line search 
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involves projecting Q back onto SO{n) which requires an inverse matrix square root or a similar 
computation in 0(n 3 ). 

For comparison, each step of fast ICA [Hyvarinen and Oja, 1997] for a complete demixing 
matrix takes 0(n 2 m) when using hierarchical orthogonalization and 0(n 2 m + n 3 ) for symmet- 
ric orthogonalization. The same applies to fitting an ISA model [Hyvarinen and Hoyer, 2000; 
Hyvarinen and Koster, 2006, 2007]. A Gaussian Scale Mixture (GSM) model does not need to 
estimate another orthogonal rotation Q because it belongs to the class of spherically symmetric 
distributions and is, therefore, invariant under transformations from SO(n) [Wainwright and Si- 
moncclli, 2000]. Therefore, fitting a GSM corresponds to estimating the parameters of the scale 
distribution which is in (D(nm) in the best case but might be costlier depending on the choice of 
the scale distribution. 

5.2. Sampling. In this section, we derive a sampling scheme for arbitrary L p -nested symmetric 
distributions which can for example be used for solving integrals when using Lp-nested symmet- 
ric distributions for Bayesian learning. Exact sampling from an arbitrary L p -nested symmetric 
distribution is in fact straightforward due to the following observation: Since the radial and the 
uniform component are independent, normalizing a sample from any L p -nested distribution to 
/-length one yields samples from the uniform distribution on the L p -nested unit sphere. By mul- 
tiplying those uniform samples with new samples from another radial distribution, one obtains 
samples from another L p -nested distribution. Therefore, for each L p -nested function / a single 
L p -nested distribution which can be easily sampled from is enough. Sampling from all other Lp- 
nested distributions with respect to / is then straightforward due to the method we just described. 
Gupta and Song [1997] sample from the p-generalized Normal distribution since it has indepen- 
dent marginals which makes sampling straightforward. Due to Proposition 5, no such factorial 
L p -nested distribution exists. Therefore, a sampling scheme like for Lp-spherically symmetric dis- 
tributions is not applicable. Instead we choose to sample from the uniform distribution inside 
the L p -nested unit ball for which we already computed the radial distribution in Example 5. The 
distribution has the form p(x) — y^jy- In order to sample from that distribution, we will first 
only consider the uniform distribution in the positive quadrant of the unit Lp-nested ball which 
has the form p(x) = ■ Samples from the uniform distributions inside the whole ball can be 
obtained by multiplying each coordinate of a sample with independent samples from the uniform 
distribution over {—1, 1}. 

The idea of the sampling scheme for the uniform distribution inside the L p -nested unit ball is 
based on the computation of the volume of the L p -nested unit ball in Proposition 2. The basic 
mechanism underlying the sampling scheme below is to apply the steps of the proof backwards, 
which is based on the following idea: The volume of the L p -unit ball can be computed by computing 
its volume on the positive quadrant only and multiplying the result with 2™ afterwards. The key 
is now to not transform the whole integral into radial and uniform coordinates at once, but 
successively upwards in the tree. We will demonstrate this through a little example which also 
should make the sampling scheme below more intuitive. Consider the L p -nested function 



we first transform X2 and x% into radial and uniform coordinates only. According to Proposition 
1 the determinant of the mapping {x2,x$) H> (vi,u) = (|| a^2:3 > f^2:3 /|| a?2;3 ||y>i ) is given by Vi(l — 




In order to solve the integral 





i-pi 
pi 



. Therefore the integral transforms into 



I, 



{x:f(x)<l & xeIR™} 



dx = 



{vi ,x\ :f(xi ,i>i)<l Sz xi,t;iGlR-(-} 



// 



1— PI 

pi dxidvidu. 



Lp-NESTED SYMMETRIC DISTRIBUTIONS 



17 



Algorithm 5.1 Exact sampling algorithm for L p -nested distributions 

Input: The radial distribution g of an L p -nested distribution p for the L p -nested function /. 
Output: Sample x from p. 

Algorithm 

(1) Sample v$ from a beta distribution (3 [n, 1]. 

(2) For each inner node I of the tree associated with / sample the auxiliary variable si from a 

where ni k are the number of leaves in the subtree 



Dirichlet distribution Dir 



ni,i n i,f-j 
Pi ' '"' Pi 

under node /, fc. Obtain coordinates on the L p -nested sphere within the positive orthant 

i 

by sj i — y sp — ui (the exponentiation is taken component-wise). 

(3) Transform these samples to Cartesian coordinates by vz ■ uz = vz t \j [ for each inner node, 
starting from the root node and descending to lower layers. The components of Vz i : £j 
constitute the radii for the layer direct below them. If / = 0, the radius had been sampled 
in step 1. 

(4) Once the two previous steps have been repeated until no inner node is left, we have a 
sample x from the uniform distribution in the positive quadrant. Normalize x to get a 
uniform sample from the sphere u = j^y- 

(5) Sample a new radius from the radial distribution of the target radial distribution g and 
obtain the sample via x = 5g • u. 

(6) Multiply each entry n of x by an independent sample Zi from the uniform distribution 
over {—1, 1}. 
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Now we can separate the integrals over x\ and v\, and the integral over u since the boundary of 
the outer integral does only depend on v± and not on u: 

dx = / (1 — u Pl ) pi du ■ / / Vidxidvi. 

{x:f(x)<l & XGIR^} J J {v 1 ,x 1 :f(x 1 ,v 1 )<l & xi,«iGlR+} J 

The value of the first integral is known explicitly since the integrand equals the uniform distribution 
on the || • Hpj-unit sphere. Therefore the value of the integral must be its normalization constant 
which we can get using Proposition 2. 

(1 - np^'^rdu 

An alternative way to arrive at this result is to use the transformation s — ii Pl and to notice that 
the integrand is a Dirichlet distribution with parameters on = ^ . The normalization constant of 
the Dirichlet distribution and the constants from the determinant Jacobian of the transformation 
yield the same result. 

In order to compute the remaining integral, the same method can be applied again yielding 
the volume of the L p -nested unit ball. The important part for the sampling scheme, however, is 
not the volume itself but the fact that the intermediate results in this integration process equal 
certain distributions. As shown in Example 5 the radial distribution of the uniform distribution 
on the unit ball is /3 [n, 1], and as just indicated by the example above the intermediate results 
can be seen as transformed variables from a Dirichlet distribution. This fact holds true even for 
more complex L p -nested unit balls although the parameters of the Dirichlet distribution can be 
slightly different. Reversing the steps leads us to the following sampling scheme. First, we sample 
from the /3-distribution which gives us the radius v$ on the root node. Then we sample from the 
appropriate Dirichlet distribution and exponentiate the samples by ^- which transforms them into 
the analogs of the variable u from above. Scaling the result with the sample v% yields the values 
of the root node's children, i.e. the analogs of x\ and V\. Those are the new radii for the levels 
below them where we simply repeat this procedure with the appropriate Dirichlet distributions 
and exponents. The single steps are summarized in Algorithm 5.1. 
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The computational complexity of the sampling scheme is 0(n). Since the sampling procedure 
is like expanding the tree node by node starting with the root, the number of inner nodes and 
leaves is the total number of samples that have to be drawn from Dirichlet distributions. Every 
node in an L p -nested tree must at least have two children. Therefore, the maximal number of 
inner nodes and leaves is 2n — 1 for a full binary tree. Since sampling from a Dirichlet distribution 
is also in 0(n) the total computational complexity for one sample is in 0{n). 

6. Robust Bayesian Inference of the Location 

For Lp-spherically symmetric distributions with a location and a scale parameter p(x\fi, r) = 
T n p{\\T(x — £t)|| p ) Osiewalski and Steel [1993] derived the posterior in closed form using a prior 
p(//,r) = p(p) ■ c ■ t , and showed that p(x,n) does not depend on the radial distribution 
g, i.e. the particular type of L p -spherically symmetric distributions used for a fixed p. The 
prior on r corresponds to an improper Jeffrey's prior which is used to represent lack of prior 
knowledge on the scale. The main implication of their result is that Bayesian inference of the 
location \x under that prior on the scale does not depend on the particular type of L p -spherically 
symmetric distribution used for inference. This means that under the assumption of an L p - 
spherically symmetric distributed variable, for a fixed p, one does have to know the exact form of 
the distribution in order to compute the location parameter. 

It is straightforward to generalize their result to L p -nested symmetric distributions and, hence, 
making it applicable to a larger class of distributions. Note that when using any L p -nested 
symmetric distribution, introducing a scale and a location via the transformation x t-¥ t[x — \i) 
introduces a factor of r" in front of the distribution. 

Proposition 6. For fixed values P0,Pi, ... and two independent priors p(/i.,r) = p(/x) • cr" 1 of the 
location p, and the scale r where the prior on r is an improper Jeffrey 's prior, the joint distribution 
p(x, fi) is given by 

p(x, fi) = f(x - ii)~ n ■ c • ^ • 

where Z denotes the normalization constant of the L p -nested uniform distribution. 

Proof. Given any L p -nested symmetric distribution p(f(x)) the transformation into the polar-like 
coordinates yields the following relation 

[p(f(x))dx= f f l[G L (u z )r n - 1 p(r)drdu= f J{G L {u z )dvi- ( r n - x p{r)dr. 

Since Y[Lec^ L ( u L) ^ s ^ ne unnormalized uniform distribution on the L p -nested unit sphere, the 
integral must equal the normalization constant that we denote with Z for brevity (see Proposition 
3 for an explicit expression). This implies that p has to fulfill 

1 



1 



Z 



= / r n - L p{r)d 



r. 



Writing down the joint distribution of a?,/x and r, and using the substitution s = rf(x — fx) we 
obtain 



p(x, fi) = J T n p(f(T(x - /Lt))) • ct 1 • p{p)dr 
= j S n - 1 p(s)-c-p( t j,)f(x- f i)- n ds 
= f{x -fi)- n - c-i -p(/i). 



□ 



Note that this result could easily be extended to ^-spherical distributions. However, in this 
case the normalization constant Z cannot be computed for most cases and, therefore, the posterior 
would not be known explicitly. 
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7. Relations to ICA, ISA and Over-Complete Linear Models 

In this section, we explain the relations among Lp-spherically symmetric, Lp-nested symmetric, 
ICA and ISA models. For a general overview see Figure 4. 

The density model underlying ICA models the joint distribution of the signal a; as a linear 
superposition of statistically independent hidden sources Ay = x or y — Wx. If the marginals of 
the hidden sources are belong to the exponential power family we obtain the p-generalized Normal 
which is a subset of the Lp-spherically symmetric class. The p-generalized Normal distribution 
p(y) oc exp(— r|jy||p is a density model that is often used in ICA algorithms for kurtotic natural 
signals like images and sound by optimizing a dcmixing matrix W w.r.t. to the model p(y) oc 
exp(-r||V^£c||p [Lee and Lewicki, 2000; Lewicki, 2002; Zhang et al., 2004]. It can be shown 
that the p-generalized Normal is the only factorial model in the class of Lp-spherically symmetric 
models [Sinz et al., 2009a], and, by Proposition 5, also the only factorial Lp-nested symmetric 
distribution. 




FIGURE 4. Relations between the different classes of distributions: For arbitrary 
distributions on the subspaces ISA (blue) is a superclass of ICA (green). Obvi- 
ously, Lp-nested symmetric distributions (red) are a superclass of Lp-spherically 
symmetric distributions (yellow). L p -nested ISA models live in the intersection 
of Lp-nested symmetric distributions and ISA models (intersection of red and 
blue). Those L p -nested ISA models that are Lp-spherically symmetric are also 
ICA models (intersection of green and yellow) . This is the class of p-generalized 
Normal distributions. If p is fixed to two, one obtains the spherically symmetric 
distributions (pink). The only class of distributions in the intersection between 
spherically symmetric distributions and ICA models is the Gaussian (intersection 
green, yellow and pink). 

An important generalization of ICA is the Independent Subspace Analysis (ISA) proposed 
by Hyvarinen and Hoyer [2000] and by Hyvarinen and Koster [2007] who used Lp-spherically 
symmetric distributions to model the single subspaces. Like in ICA, also ISA models the hidden 
sources of the signal as a product of multivariate distributions: 

K 

p{y) = \\p{yi k )- 

k=l 



20 



FABIAN SINZ AND MATTHIAS BETHGE 



Here, y = Wx and Ik are index sets selecting the different subspaces from the responses of W to 
x. The collection of index sets Ik forms a partition of 1, n. ICA is a special case of ISA in which 
Ik = {k} such that all subspaces are one-dimensional. For the ISA models used by Hyvarinen et al. 
the distribution on the subspaces was chosen to be either spherically or Lp-spherically symmetric. 

ICA and ISA have been used to infer features from natural signals, in particular from natural 
images. However, as mentioned by several authors [Simoncelli, 1997; Wainwright and Simoncelli, 
2000; Zetzsche et al., 1993] and demonstrated quantitatively by Bethge [2006] and Eichhorn et al. 
[2008], the assumptions underlying linear ICA are not well matched by the statistics of natural 
images. Although the marginals can be well described by an exponential power family, the joint 
distribution cannot be factorized with linear filters W. 

A reliable parametric way to assess how well the independence assumption is met by a signal at 
hand is to fit a more general class of distributions that contains factorial as well as non-factorial 
distributions which both can equally well reproduce the marginals. By comparing the likelihood 
on held out test data between the best fitting non-factorial and the best-fitting factorial case, one 
can asses how well the sources can be described by a factorial distribution. For natural images, 
for example, one can use an arbitrary L p -spherically symmetric distribution p(||Wa;||p), fit it to 
the whitened data and compare its likelihood on held out test data to the one of the p-generalized 
Normal [Sinz and Bethge, 2009]. Since any choice of radial distribution g determines a particular 
Lp-spherically symmetric distribution the idea is to explore the space between factorial and non- 
factorial models by using a very flexible density g on the radius. Note that having an explicit 
expression of the normalization constant allows for particularly reliable model comparisons via the 
likelihood. For many graphical models, for instance, such an explicit and computable expression 
is often not available. 




Figure 5. Tree corresponding to an L p -nested ISA model. 



The same type of dependency-analysis can be carried out for ISA using L p -nested symmetric 
distributions [Sinz et al., 2009b]. Figure 5 shows the L p -nested tree corresponding to an ISA with 
four subspaces. For general such trees, each inner node — except the root node — corresponds to a 
single subspace. When using the radial distribution 
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the subspaces v 1 ,...,v e<l become independent and one obtains an ISA model of the form 
P (y) = - exp ^ — 

= i ex /_Eti^lk 



exp (_ E4lMJk 



n 









pi 



Put 

which has Lp-spherically symmetric distributions on each subspace. Note that this radial distribu- 
tion is equivalent to a Gamma distribution whose variables have been raised to the power of ^- . In 
the following we will denote distributions of this type with j p (u, s) , where u and s are the shape 
and scale parameter of the Gamma distribution, respectively. The particular 7 P distribution that 
results in independent subspaces has arbitrary scale but shape parameter u = j*-. When using 
any other radial distribution, the different subspaces do not factorize and the distribution is also 
not an ISA model. In that sense Lp-nested symmetric distributions are a generalization of ISA. 
Note, however, that not every ISA model is also L p -nested symmetric since not every product of 
arbitrary distributions on the subspaces, even if they are L p -spherically symmetric, must also be 
.Lp-nested. 

It is natural to ask, whether Lp-ncstcd symmetric distributions can serve as a prior distri- 
bution p{y\&) over hidden factors in over-complete linear models of the form p(x\W,o~,'d) = 
J p(x\Wy, a)p(y\&)dy, where p(x\Wy) represents the likelihood of the observed data point x given 
the hidden factors y and the over-complete matrix W. For example, p(x\Wy, a) = Af(Wy, a ■ I) 
could be a Gaussian like in Olshausen and Field [1996]. Unfortunately, such a model would suffer 
from the same problems as all over-complete linear models: While sampling from the prior is 
straightforward sampling from the posterior p(y\x, W, i?, a) is difficult because a whole subspace 
of y leads to the same x. Since parameter estimation either involves solving the high-dimensional 
integral p(x \ W, a, #) = J p(x\Wy, a)p{y\'d)dy or sampling from the posterior, learning is compu- 
tationally demanding in such models. Various methods have been proposed to learn W, ranging 
from sampling the posterior only at its maximum [Olshausen and Field, 1996], approximating the 
posterior with a Gaussian via the Laplace approximation [Lewicki and Olshausen, 1999] or using 
Expectation Propagation [Seeger, 2008]. In particular, all of the above studies either do not fit 
hypcr-paramctcrs i? for the prior [Lewicki and Olshausen, 1999; Olshausen and Field, 1996] or 
rely on the factorial structure of it [Seeger, 2008]. Since L p -nested distributions do not provide 
such a factorial prior, Expectation Propagation is not directly applicable. An approximation like 
in Lewicki and Olshausen [1999] might be possible, but additionally estimating the parameters 
i? of the Lp-nested symmetric distribution adds another level of complexity in the estimation 
procedure. Exploring such over-complete linear models with a non-factorial prior may be an in- 
teresting direction to investigate, but it will need a significant amount of additional numerical and 
algorithmical work to find an efficient and robust estimation procedure. 



8. Nested Radial Factorization with L p -Nested Symmetric Distributions 

Lp-nested symmetric distribution also give rise to a non-linear ICA algorithm for linearly mixed 
non- factorial Lp-nested hidden sources y. The idea is similar to the Radial Factorization algorithms 
proposed by Lyu and Simoncelli [2009] and Sinz and Bethge [2009]. For this reason, we call it 
Nested Radial Factorization (NRF). For a one layer L p -nested tree, NRF is equivalent to Radial 
Factorization as described in Sinz and Bethge [2009]. If additionally p is set to p = 2, one obtains 
the Radial Gaussianization by Lyu and Simoncelli [2009]. Therefore, NRF is a generalization of 
Radial Factorization. It has been demonstrated that Radial Factorization algorithms outperform 
linear ICA on natural image patches [Lyu and Simoncelli, 2009; Sinz and Bethge, 2009]. Since 
-Lp-nested symmetric distributions are slightly better in likelihood on natural image patches [Sinz 
et al., 2009b] and since the difference in the average log- likelihood directly corresponds to the 
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reduction in dependencies between the single variables [Sinz and Bethge, 2009], NRF will slightly 
outperform Radial Factorization on natural images. For other type of data the performance will 
depend on how well the hidden sources can be modeled by a linear superposition of — possibly 
non-independent — L p -nested symmetrically distributed sources. Here we state the algorithm as a 
possible application of L p -nested symmetric distributions for unsupervised learning. 

The idea is based on the observation that the choice of the radial distribution g already deter- 
mines the type of L p -nested symmetric distribution. This also means that by changing the radial 
distribution by remapping the data, the distribution could possibly be turned in a factorial one. 
Radial Factorization algorithms fit an L p -spherically symmetric distribution with a very flexible 
radial distribution to the data and map this radial distribution g s (s for source) into the one of a 
p-generalized Normal distribution by the mapping 



(13) 



y i-> 



WvWp 



y- 



where J 7 ^ and T s are the cumulative distribution functions of the two radial distributions involved. 
The mapping basically normalizes the demixed source y and rescales it with a new radius that 
has the correct distribution. 

Exactly the same method cannot work for L p -nested symmetric distributions since Proposition 
5 states that there is no factorial distribution we could map the data to by merely changing the 
radial distribution. Instead we have to remap the data in an iterative fashion beginning with 
changing the radial distribution at the root node into the radial distribution of the L p -nested ISA 
shown in equation (12). Once the nodes are independent, we repeat this procedure for each of 
the child nodes independently, then for their child nodes and so on, until only leaves are left. The 
rescaling of the radii is a non-linear mapping since the transform in equation (13) is non-linear. 
Therefore, NRF is a non-linear ICA algorithm. 
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Figure 6. L p -nested non-linear ICA for the tree of Example 6: For an arbitrary 
Lp-nested symmetric distribution, using equation (13), the radial distribution can 
be remapped such that the children of the root node become independent. This is 
indicated in the plot via dotted lines. Once the data has been rescaled with that 
mapping, the children of root node can be separated. The remaining subtrees are 
again L p -nested symmetric and have a particular radial distribution that can be 
remapped into the same one that makes their root nodes' children independent. 
This procedure is repeated until only leaves are left. 



We demonstrate this at a simple example. 
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Example 6. Consider the function 

f(y) = M2/1P + [ I2/2P' 2 + (|2/ 3 P' 2 + M P2 * 2 )^ 



p 0,2 \ P0 2 



for y = Wx where W as been estimated by fitting an L p -nested symmetric distribution with a 
flexible radial distribution to Wx as described in Section 5.1. Assume that the data has already 
been transformed once with the mapping of equation (13). This means that the current radial 
distribution is given by (12) where we chose s = 1 for convenience. This yields a distribution of 
the form 



p(y) = 



U 



exp -m m - bp* 2 + (bp- 2 + bp' 2 )* 2 ' 2 
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Now we can separate the distribution of y\ from the distribution over y%, ■••,2/4- The distribution 
of ?/i is a p-generalized Normal 



P(j/i) = 



p(y2,-,yi) 
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By using equation (10) we can identify the new radial distribution to be 

/ x ( pA 

e\ v ®a) = r „- .i ex P \- y $,2 ) ■ 



r 



"a, 2 



Replacing this distribution by the one for the p-generalized Normal (for data we would use the 
mapping in equation (13)), we obtain 



P(V2 



[" "0,: 
[pus 



p 0,2 



exp -|y 2 p> 2 -(|y 3 p' 2 + |y4p' 2 ) P2 ' 2 



r 


ni 
Vi _ 




ntir 


ni.k 
Pi 



iex\0 

Now, we can separate out the distribution of yi which is again p-generalized Normal. This leaves 
us with the distribution for y 3 and y4 



p(2/3,2/4) = 
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For this distribution we can repeat the same procedure which will also yield p-generalized Normal 
distributions for y 3 and 2/4. 
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Algorithm 8.1 Recursion NRF(y, /, g s ) 

Input: Data point y, L p -nested function /, current radial distribution g s 
Output: Non-linear ly transformed data point y 

Algorithm 

(1) Set the target radial distribution to be £>_ll 7 P 

(2) Set y <— — ^ly)^^ ' V where T denotes the cumulative distribution function the re- 
spective Q. 

(3) For all children i of the root node that are not leaves: 




(a) Set g s j p 



1 1 2 



Pa ' r i 55 

LP0 J 



(b) Set yg _j <~ NRF(y0 i; /g ^, Note that in the recursion 0, i will become the new 
(4) Return y 



This non-linear procedure naturally carries over to arbitrary L p -nested trees and distributions, 
thus yielding a general non-linear ICA algorithm for linearly mixed non-factorial L p -nested sources. 
For generalizing Example 6, note the particular form of the radial distributions involved. As al- 
ready noted above the distribution (12) on the root node's values that makes its children statistical 
independent is that of a Gamma distributed variable with shape parameter ^ and scale parame- 
ter s which has been raised to the power of — . In Section 7 we denoted this class of distributions 
with 7 P [u,s], where u and s are the shape and the scale parameter, respectively. Interestingly, 
the radial distributions of the root node's children are also 7 P except that the shape parameter is 
r ^-- The goal of the radial remapping of the children's values is hence just changing the shape 
parameter from to ^i>. Of course, it is also possible to change the scale parameter of the 
single distributions during the radial remappings. This will not affect the statistical independence 
of the resulting variables. In the general algorithm, that we describe now, we choose s such that 
the transformed data is white. 

The algorithm starts with fitting a general L p -nested model of the form p(Wx) as described in 
Section 5.1. Once this is done, the linear demixing matrix W is fixed and the hidden non-factorial 
sources are recovered via y = Wx. Afterwards, the sources y are non-linearly made independent 
by calling the recursion specified in Algorithm 8.1 with the parameters Wx, f and g, where g is 
the radial distribution of the estimated model. 

The computational complexity for transforming a single data point is 0(n 2 ) because of the 
matrix multiplication Wx. In the non-linear transformation, each single data dimension is not 
rescaled more that n times which means that the rescaling is certainly also in 0(n 2 ). 

An important aspect of NRF is that it yields a probabilistic model for the transformed data. 
This model is simply a product of n independent exponential power marginals. Since the radial 
remappings do not change the likelihood, the likelihood of the non-linearly separated data is the 
same as the likelihood of the data under L p -nested symmetric distribution that was fitted to it in 
the first place. However, in some cases, one might like to fit a different distribution to the outcome 
of Algorithm 8.1. In that case the determinant of the transformation is necessary to determine 
the likelihood of the input data — and not the transformed one — under the model. The following 
lemma provides the determinant of the Jacobian for the non-linear rescaling. 

Lemma 2 (Determinant of the Jacobian). Let z = NRF(Wx, /, g s ) as described above. Let ti 
denote the value ofWx below the inner node I which have been transformed with Algorithm 8.1 
up to node 1. Let gi{r) = (J- gAL ° Fg 3 )(r) denote the radial transform at node I in Algorithm 8.1. 
Furthermore, let I denote the set of all inner nodes, excluding the leaves. Then, the determinant 
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of the Jacobian ( J is given by 



(14) 



det|* 
dxj 



| det W\ ■ Yl 



lei 



gdfiit!))"'- 1 Qs(fi(ti)) 



/j(tz)^- 1 QAL(9l(fl(tl))) 

Proof. The proof can be found in the Appendix E. □ 

9. Conclusion 

In this article we presented a formal treatment of the first tractable subclass of j/-spherical 
distributions which generalizes the important family of Lp-spherically symmetric distributions. We 
derived an analytical expression for the normalization constant, introduced a coordinate system 
particularly tailored to L p -nested functions and computed the determinant of the Jacobian for 
the corresponding coordinate transformation. Using these results, we introduced the uniform 
distribution on the Lp-nested unit sphere and the general form of an Lp-nested distribution for 
arbitrary L p -nested functions and radial distributions. We also derived an expression for the joint 
distribution of inner nodes of an L p -nested tree and derived a sampling scheme for an arbitrary 
Lp-nested distribution. 

Lp-nested symmetric distributions naturally provide the class of probability distributions corre- 
sponding to mixed norm priors, allowing full Bayesian inference in the corresponding probabilistic 
models. We showed that a robustness result for Bayesian inference of the location parameter 
known for L p -spherically symmetric distributions carries over to the L p -nested symmetric class. 
We discussed the relations of Lp-nested symmetric distributions to Indepedent Component (ICA) 
and Independent Subspace Analysis (ISA), and discussed its applicability as a prior distribution 
in over-complete linear models. Finally, we showed how L p -nested distributions can be used to 
construct a non-linear ICA algorithm called Nested Radial Factorization (NRF). 

The application of Lp-nested symmetric distribution has been presented in a previous conference 
paper [Sinz et al., 2009b]. Code for training this class of distribution is provided online under 
http : //www . kyb . tuebingen . mpg . de/bethge/code/. 
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Appendix A. Determinant of the Jacobian 

Lemma 1. The proof is very similar to the one in Song and Gupta [1997]. To derive equation (2) 
one needs to expand the Jacobian of the inverse coordinate transformation with respect to the 
last column using the Laplace expansion of the determinant. The term A n can be factored out of 
the determinant and cancels due to the absolute value around it. Therefore, the determinant of 
the coordinate transformation does not depend on A n . 

The partial derivatives of the inverse coordinate transformation are given by: 

d 

a — Xi = fiikr for 1 < i, k < n - 1 
ou k 

- — x n = A„r— — for 1 < k < n — 1 
d 

T—Xi — Ui for 1 < i < n — 1 
or 

d_ 
or 
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Therefore, the structure of the Jacobian is given by 

/ r 



J = 







V A„. 



U n -l 
A„M„ J 



Since we are only interested in the absolute value of the determinant and since A„ G { — 1, 1}, we 
can factor out A n and drop it. Furthermore, we can factor out r from the first n — 1 columns 
which yields 



detj| =r n - 
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du n - 



Ux 



Now we can use the Laplace expansion of the determinant with respect to the last column. For 
that purpose, let Ji denote the matrix which is obtained by deleting the last column and the ith 
row from J . This matrix has the following structure 





Ji 
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dui 
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du n - 
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We can transform Ji into a lower triangular matrix by moving the column with all zeros and 
bottom entry to the rightmost column of Ji. Each swapping of two columns introduces a factor of 
— 1. In the end, we can compute the value of det Ji by simply taking the product of the diagonal 
entries and obtain dct Ji = (-l) n_1_4 |^- This yields 



| det J] 
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du k 



□ 



Before proving Proposition 1 stating that the determinant only depends on the terms Gi(uj) 
produced by the chain rule when used upwards in the tree, let us quickly outline the essential 
mechanism when taking the chain rule for : Consider the tree corresponding to /. By definition 
u n is the rightmost leaf of the tree. Let L, £l be the multi-index of u n - As in the example, the 
chain rule starts at the leaf u n ascends in the the tree until it reaches the lowest node whose subtree 
contains both, u n and u q . At this point, it starts descending the tree until it reaches the leaf u q . 
Depending on whether the chain rule ascends or descends, two different forms of derivatives occur: 
while ascending, the chain rule produces G/(ttj-)-terms like the one in the example above. At 
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descending, it produces i*j(uj)-terms. The general definitions of the G[(uj)- and F/(u/)-terms 
are given by the recursive formulae 



fi,i I -pi 
pi 



(15) <W«*-s) = giMvfzY"-' 1 - 1 " = [ ai(n T r - £ /^K,)* 

3=1 
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fe=l 



The next two lemmata are required for the proof of Proposition 1. We use the somewhat sloppy 
notation k £ I,i r if the variable Uk is a leaf in the subtree below 7, i r . The same notation is used 
for I. 

Lemma 3. Let I = «i,...,i r _i and I,i r be any node of the tree associated with an L p -nested 
function/. Then the following recursions hold for the derivatives of gj ! i r (Uj-~-) Pl - i ^ and f\ 1 i (uj i i r ) 
w.r.t u q : If u q is not in the subtree under the node I,i r , k (jL I,i r , then 

^iUKvr = 

du q 
and 



for q £ I, j and q (l I,k for k ^ j . Otherwise 



BfuimjY" ifq£l,j 



A p ( U ^)P^ = o and A/ ( u )P/ = J^j^ ( U ( u 
for q £ I, i r , s and q (jL I, i r , k for k ^ s. 

Proof. Both of the first equations are obvious, since only those nodes have a non-zero derivative for 
which the subtree actually depends on u q . The second equations can be seen by direct computation 
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Similarly 
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for fee/, i r , s. □ 

The next lemma states the form of the whole derivative in terms of the Gi(uj)- and 
f>(u/)-terms. 

Lemma 4. Let = w^ ll ...,f mj i ll ... l i t , |u„| = «£ 1 ,.../ d wii/i m < d. The derivative of u n w.r.t. u q 

is given by 
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Proof. Successive application of Lemma (3). □ 

Proposition 1. Before we begin with the proof, note that Fj(uj) and Gj(u~) fulfill following 
equalities 

ti-i 



(17) = 9i(u f ) Pl - £ Fi,k(ui,k)fi,k(*i,kY 



\pi,k 

± 1 ,«V"i ,K ) J 1 ,KV"J ,K 

k=l 

and 



(18) fi,i n fa,ij pi ' hn = Fl,i m ,k( u i,i m ,k)fl,i m ,k( u i,i m ,k) Pl - : 

k=l 



Now let L = £i, ...,£d-i be the multi- index of the parent of u n . We compute ^=r \ det J\ and 
obtain the result by solving for | det J\. As shown in Lemma (1) „_ ± | det J\ has the form 

^Idet^l = . Wfc+U „. 

A;— 1 

By definition u n = .9L.£ d (ufr-) = gLi d {u-rir)' PLAd ■ Now, assume that u m , u n _i are children of 
L, i.e. Ufc = WL,/,j t for some /, it = ii, ...,it and m < k < n. Remember, that by Lemma (4) the 

9 
f?it„ 



terms u q -^u n for to < q < n have the form 



d 

u q —u n = -Gl^Uj-j-) ■ F Liil (u LAl ) ■ ... ■ F LJ (u LJ ) ■ \u |^i.-.«d-i.u.-.. t -i. 
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Using equation (17), we can expand the determinant as follows 

du„ 
du k 



E^+^>o^ 

k=l 



m — 1 r\ 71—1 a 

k—1 k—m 



m — 1 q / n — 1 



0^- k -u k + G L/d (u^) - ^ G L j d (uj^) ^- 

A;— 1 \ k—m 

m— 1 / n— 1 £ d — 1 

= _ E ^ ' Ufc + GL ^ (M iX ) (~ E G ^( M L^) _1 £; •^ + 5i(u £ ) Pi - J] f LiKi)/L,fc( U L,t) Pl ' 
fc— 1 \ k—m k—1 

Note that all terms Gl ij{u-r~7-)~ 1 Tr n - ■ u k for m < k < n now have the form 

d 
aitfe 

since we constructed them to be neighbors of u n . However, with equation (18), we can further 
expand the sum Y^k=i FL,k{ VL L,k)fL,k{^L,k) PL - k down to the leaves u m , u n -i. When doing 
so we end up with the same factors FL,ii(uL, tl ) • ... • Flj(ulj) • \uq\ ptl, - ,td - 1 ' il, -- tt - 1 as in the 
derivatives Ghj d (u-[^) x Uq-^Uri- This means exactly that 

n—\ o Id— 1 



k—m k—1 



and, therefore, 

m— 1 / n— 1 ^ 1 

= _ E ^' Uk + GL ^ iu LTJ (~E Gi '^ (u £^ )_1 ^ -«* + fli( u £) M - £ ^K,fe)/i,fcK,fe) 

A:— 1 \ k—m k—1 

m-1 q /<d-l <<»-l 

= - E ^ •«* + G ^K^) E ^,feK,fe)/i,feK,fe) P£ '' t +5l(u£) p£ - E ^,feK,fe)/i,feK,fc) ? 
fe=i fc Vfe=i fe=i 

m — 1 
fc=l 

By factoring out GL,i d {Uj^-j~) from the equation, the terms loose the Gl^ in front and 

we get basically the same equation as before, only that the new leaf (the new "u„") is gL(iij-) PL 
and we got rid of all the children of L. By repeating that procedure up to the root node, we 
successively factor out all Gli(ujt) for V G C until all terms of the sum vanish and we are only 
left with U0 = 1. Therefore, the determinant is 

^|det^|=n G ^ w £) 

Lec 

which completes the proof. □ 

Appendix B. Volume and Surface of the L p -Nested Unit Sphere 

Proposition 2. We obtain the volume by computing the integral Jf^^^dx. Differentiation with 
respect to R yields the surface area. For symmetry reasons we can compute the volume only on 
the positive quadrant IR" and multiply the result with 2™ later to obtain the full volume and 
surface area. The strategy for computing the volume is as follows. We start off with inner nodes / 
that are parents of leaves only. The value vj of such a node is simply the L PI norm of its children. 
Therefore, we can convert the integral over the children of I with the transformation of Gupta 
and Song [1997]. This maps the leaves Vj i-tj into vj and "angular" variables u. Since integral 



PL, 
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borders of the original integral depend only on the value of Vj and not on u, we can separate the 
variables u from the radial variables vj and integrate the variables u separately. The integration 
over u yields a certain factor, while the variable vi effectively becomes a new leaf. 

Now suppose I is the parent of leaves only. Without loss of generality let the £/ leaves correspond 
to the last li coefficients of x. Let x G IR!?. Carrying out the first transformation and integration 
yields 



/ dx = / vj 1 1 1 - V" SfM dvidudx 1:n ^i r 

Jf{x)<R Jf( Xl:n - ei , VI )<RJi 1 .£V e + I 1 \ i=1 / 

i l 

= / vF^dvtdxm-t, x / f 1 - £ uA 



du. 



For solving the second integral we make the pointwise transformation Si — u^ 1 and obtain 



1 - E 5 *' 



du 



Pi 



I>i<i 
1,-1 



n 



p/ fe=l 



fe=l 
-1 



Pi 



fc=i 



pi 

k_ 1_ 

Pi' Pi 



Pi 



by using the fact that the transformed integral has the form of an unnormalized Dirichlct distri- 
bution and, therefore, the value of the integral must equal its normalization constant. 
Now, we solve the integral 

(19) / v™'~ 1 dvidx 1 .. n -i I . 

We carry this out in exactly the same manner as we solved the previous integral. We only need to 
make sure that we only contract nodes that have only leaves as children (remember that radii of 
contracted nodes become leaves) and we need to find a formula how the factors v™ 1 " propagate 
through the tree. 

For the latter, we first state the formula and then prove it via induction. For notational 
convenience let J denote the set of multi-indices corresponding to the contracted leaves, x ^ the 
remaining coefficients of x and v j the vector of leaves resulting from contraction. The integral 
which is left to solve after integrating over all u is given by (remember that nj denotes real leaves, 
i.e. the ones corresponding to coefficients of a;): 



n 



~ l dv jdx j. 



We already proved the first induction step by computing equation (19). For computing the general 
induction step suppose / is an inner node whose children are leaves or contracted leaves. Let J' 
be the set of contracted leaves under / and K — J\J' . Transforming the children of / into radial 
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f n vy-^vjdxf = f f n w 1 ) ■ ( n *y ,_1 ) 



31 



f(x£,V)c,vi)<R Jui I - 1 eV + I 



P7 



V 



i=l 



n 



fj-i 

i=l 



pj <j- 1 

j[J («/"/c)" fe_1 J dx£dv K dvidu ei -i 



An 



>K 



\KEK 



X V, 



f-i+E'iiK-i) 



1 - ^ Mj J • J Wfe fe _1 | dxfcdvicdvrfui^ 

v i=l / fe=l 



f{x^,v K ,vi)<R 



\KeK ) 



tj -PI 



X 



Again, by transforming it into a Dirichlet distribution, the latter integral has the solution 

"It -pj 



Ju 4j _iev + J \ i=1 / fe=1 fc=1 



Pi 



Pi 



while the remaining former integral has the form 



Jf(<B R ,VK,v r )<R \ifeJC / J f(Kj,vj)<R j^j 

as claimed. 

By carrying out the integration up to the root node the remaining integral becomes 

fR nn 



v t <R 



«0 x d«0 = / 1 dv ffl = 







y aw 



Collecting the factors from integration over the u proves the equations (6) and (8). Using B [a, b] = 
^Sjr yields equations (7) and (9). □ 



Proposition 4- 



p(x) 



Appendix C. Layer Marginals 



Sf(f(x)) 

o(f(x-\-„—e..Vr.Ui.—-\.A„)) i, 



1,-1 



<--pi 
pi 



:S2 
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where A„ = sign(x„). Note that / is invariant to the actual value of A„. However, when 
integrating it out, it yields a factor of 2. Integrating out ut^i and A„ now yields 

p( Xl:n - il ,v I )= Sf{f(x)) 

= Q{f{x\:n-e. n Vl)) _ ? 



2 i IT i 1 


j_ 






Pi T 




-l 





Now, we can go on an integrate out more subtrees. For that purpose, let x ^ denote the remaining 
coefficients of cc, vj the vector of leaves resulting from the kind of contraction just shown for vi 
and J the set of multi-indices corresponding to the "new leaves" , i.e the node i>j after contraction. 
We obtain the following equation 

Q{f{Xj,Vj)) 



S f (f(x?,vj)) 



IT 



where nj denotes the number of leaves in the subtree under the node J. The calculations for the 
proof are basically the same as the one for proposition (2). □ 

Appendix D. Factorial L p -Nested Distributions 

Proposition 5. Since the single Xi are independent, fx (x± ),..., ft^ipt^) and, therefore, Ui,...,u^ 
must be independent as well (xi are the elements of x in the subtree below the ith child of the 
root node). Using Corollary 2 we can write the density of ■■■,vi lj as (the function name g is 
unrelated to the usage of the function g above) 



P( v l:h) = Y[ h tM = ^(11^1:^0 lly»0 ) XT '' 



i=l 



i=l 



with 



S(IK:<«IL) = 





n 

P® _ 






rih 
P0 _ 



IIP0/ 



Since the integral over g is finite, it follows from Sinz et al. [2009a] that g has the form <?(||i>i :J 
exp(et0||'Ui : £ e II™ + b$) for appropriate constants a® and b$. Therefore, the marginals have the form 

(20) hi(vi) = exp(a uf + eg)?;™ 1-1 . 

On the other hand, the particular form of g implies that the radial density has the form 
g(f(x)) oc /(a;)*™ -1 ) exp(a0/(a;) P0 +£>0) Pa . In particular, this implies that the root node's children 
fi(xi) (i — 1, ^0) are independent and L p -nested again. With the same argument as above, 
it follows that their children Vi.\.j i follow the distribution p(vi t ±, Uj^) = exp(aj||'Uj ) i : ^ i ||^* + 
bi) Ylj=i v?]' 3 ■ Transforming that distribution to i p -spherically symmetric polar coordinates 
v % = llfi.i^tllpi and u = w«,i^ 4 — i/||vi,i:£ 4 \\ Pi as in Gupta and Song [1997], we obtain the form 



1-p 



p(vi,u) = exp(aiwf + bi)v\ 



-1 



m-i 




n 



exp(a i vf i + bi)v 



where the second equation follows the same calculations as in the proof of 2. After integrating out 
u, assuming that the Xi are statistically independent, we obtain the density of Vi which is equal to 
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(20) if and only if pi = pg. However, if p$ and Pi are equal, the hierarchy of the L p -nested function 
shrinks by one layer since pi and p§ cancel themselves. Repeated application of the above argument 
collapses the complete L p -nested tree until one effectively obtains an L p -spherical function. Since 
the only factorial L p -spherically symmetric distribution is the p-generalized Normal [Sinz et al., 
2009a] the claim follows. □ 

Appendix E. Determinant of the Jacobian for NRF 

Lemma 2. The proof is a generalization of the proof of Lyu and Simoncelli [2009]. Due to the 
chain rule the Jacobian of the entire transformation is the multiplication of the Jacobians for each 
single step, i.e. the rescaling of a subset of the dimensions for one single inner node. The Jacobian 
for the other dimensions is simply the identity matrix. Therefore, the determinant of the Jacobian 
for each single step is the determinant for the radial transformation on the respective dimensions. 
We show how to compute the determinant for a single step. 

Assume that we reached a particular node I in Algorithm 8.1. The leaves, which have been 
rescaled by the preceding steps, are called tj. Let = fff^jy 1 *i w hh gi(r) = (J-]^ ° J- S )(r). 
The general form of a single Jacobian is 

dil = , d / g j(/i(t f )) \ 

Oh 11 ' dtj \ hitj) ) + fj(tj) in " 

where 

A f gi(fi(ti)) \ = ( g'Afijti)) gi(fi(ti)) \ d_ 

dtj { fj(tj) ) \ fj(tj) fj(tjy J dtj JI[ Ih 

Let yi be a leave in the subtree under / and let I, J±, Jk be the path of inner nodes from / 
to yi, then 

± fl ( tl ) = ^- p sr PJl . ... . ^- i-pj * \ y r^ ■ Bgny,. 

If we denote r = fi(tr) and Ci = PJl • ■•■ 1 v k' k l P ' k \Vi\ VJk ~ 1 ' s S n Ui f° r the respective Jk, 
we obtain 

det (t, • A (mm) + = dot (Uo - ^ r-*z C T + ^ 



at, V //(*/) / //(*/) 

Now we can use Sylvester's determinant formula det(/„ + 6t/C T ) = det(l + btj C) = 1 + btJC 



or equivalently 



det(a/„ + &t/C ' ) = det I a • ( I n + -t^ 1 



1 det ( J„ + -i/C 1 
a 



a 



n-l 



(a + btjC), 



as well as tjC = fi{ti) Pl = r Pl to see that 



det ( (rf( r ) - ^M) r -~i, • ^ + = ^ det ( (^) - ^1) r-tj . C + ^ 

= ^r-^(r). 

|:5x(r) is readily computed via J:Si(r) = J:^! 1 ° 7" a )(r) = e ^} r) ) ■ 

Multiplying the single determinants along with det for the final step of the chain rule com- 
pletes the proof. □ 
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